244 research outputs found
CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules
Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class
Predicting long-term publication impact through a combination of early citations and journal impact factor
The ability to predict the long-term impact of a scientific article soon
after its publication is of great value towards accurate assessment of research
performance. In this work we test the hypothesis that good predictions of
long-term citation counts can be obtained through a combination of a
publication's early citations and the impact factor of the hosting journal. The
test is performed on a corpus of 123,128 WoS publications authored by Italian
scientists, using linear regression models. The average accuracy of the
prediction is good for citation time windows above two years, decreases for
lowly-cited publications, and varies across disciplines. As expected, the role
of the impact factor in the combination becomes negligible after only two years
from publication
Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers
Machine Learning (ML) algorithms are used to train computers to perform a
variety of complex tasks and improve with experience. Computers learn how to
recognize patterns, make unintended decisions, or react to a dynamic
environment. Certain trained machines may be more effective than others because
they are based on more suitable ML algorithms or because they were trained
through superior training sets. Although ML algorithms are known and publicly
released, training sets may not be reasonably ascertainable and, indeed, may be
guarded as trade secrets. While much research has been performed about the
privacy of the elements of training sets, in this paper we focus our attention
on ML classifiers and on the statistical information that can be unconsciously
or maliciously revealed from them. We show that it is possible to infer
unexpected but useful information from ML classifiers. In particular, we build
a novel meta-classifier and train it to hack other classifiers, obtaining
meaningful information about their training sets. This kind of information
leakage can be exploited, for example, by a vendor to build more effective
classifiers or to simply acquire trade secrets from a competitor's apparatus,
potentially violating its intellectual property rights
Common operation scheduling with general processing times: A branch-and-cut algorithm to minimize the weighted number of tardy jobs
Common operation scheduling (COS) problems arise in real-world applications, such as industrial processes of material cutting or component dismantling. In COS, distinct jobs may share operations, and when an operation is done, it is done for all the jobs that share it. We here propose a 0-1 LP formulation with exponentially many inequalities to minimize the weighted number of tardy jobs. Separation of inequalities is in NP, provided that an ordinary min Lmax scheduling problem is in P. We develop a branch-and-cut algorithm for two cases: one machine with precedence relation; identical parallel machines with unit operation times. In these cases separation is the constrained maximization of a submodular set function. A previous method is modified to tackle the two cases, and compared to our algorithm. We report on tests conducted on both industrial and artificial instances. For single machine and general processing times the new method definitely outperforms the other, extending in this way the range of COS applications
A stochastic estimated version of the Italian dynamic General Equilibrium Model (IGEM)
We estimate with Bayesian techniques the Italian dynamic General Equilibrium Model (IGEM), which has been developed at the Italian Treasury Department, Ministry of Economy and Finance, to assess the effects of alter-native policy interventions. We analyze and discuss the estimated effects of various shocks on the Italian economy. Compared to the calibrated version used for policy analysis, we find a lower wage rigidity and higher adjustment costs. The degree of prices and wages indexation to past inflation is much smaller than the indexation level assumed in the calibrated model. No substantial difference is found in the estimated monetary parameters. Estimated fiscal multipliers are slightly smaller than those obtained from the calibrated version of the model
LAF : Logic Alignment Free and its application to bacterial genomes classification
Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences. In this paper, we present Logic Alignment Free (LAF), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (if-then rules). We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k-mers. State of the art methods to adjust the frequency of k-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.Peer reviewe
- …